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Many complex systems in nature and society can be described in terms of networks capturing 
the intricate web of connections among the units they are made of | l]|2j|3j|4|. A question of interest 
is how to interpret the global organisation of such networks as the coexistence of their structural 
sub-units (communities) associated with more highly interconnected parts. Identifying these a pri- 
ori unknown building blocks (functionally related proteins 15 , 6 1, industrial sectors |7|, groups of 
people |8, 9 1, etc.) is crucial to the understanding of the structural and functional properties of 
networks. The existing deterministic methods used for large networks find separated communi- 
ties, while most of the actual networks are made of highly overlapping cohesive groups of nodes. 
Here we introduce an approach to analyse the main statistical features of the interwoven sets of 
overlapping communities making a step towards the uncovering of the modular structure of com- 
plex systems. After defining a set of new characteristic quantities for the statistics of communities, 
we apply an efficient technique to explore overlapping communities on a large scale. We find that 
overlaps are significant, and the distributions we introduce reveal universal features of networks. 
Our studies of collaboration, word association, and protein interaction graphs demonstrate that 
the web of communities has non-trivial correlations and specific scaling properties. 

Most real networks typically contain parts in which the nodes (units) are more highly connected to 
each other than to the rest of the network. The sets of such nodes are usually called clusters, communities, 
cohesive groups, or modules mEBIT^ElEl. having no widely accepted, unique definition. In spite 
of this ambiguity the presence of communities in networks is a signature of the hierarchical nature of 
complex systems 10 [T4). The existing methods for finding communities in large networks are useful if 
the community structure is such that it can be interpreted in terms of separated sets of communities (see 
Fig-ID 5 an d Refs. lITT^ IT31 fTTH [FTl [T8K However, most real networks are characterised by well defined 
statistics of overlapping and nested communities. Such a statement can be demonstrated by the numerous 
communities each of us belongs to, including those related to our scientific activities or personal life 
(school, hobby, family) and so on, as illustrated in Fig. [^. Furthermore, members of our communities 
have their own communities, resulting in an extremely complicated web of the communities themselves. 
This has long been appreciated by sociologists |[T9l . but has never been studied systematically for large 
networks. Another, biological example is that a large fraction of proteins belong to several protein 
complexes simultaneously [20 1. 

In general, each node i of a network can be characterised by a membership number m;, which is 
the number of communities the node belongs to. In turn, any two communities a and (5 can share s° v ^ 
nodes, which we define as the overlap size between these communities. Naturally, the communities 
also constitute a network with the overlaps being their links. The number of such links of community 
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Figure 1: Illustration of the concept of overlapping communities, a) The black dot in the middle rep- 
resents either of the authors of this Letter, with several of his communities around. Zooming into the 
scientific community demonstrates the nested and overlapping structure of the communities, while de- 
picting the cascades of communities starting from some members exemplifies the interwoven structure 
of the network of communities, b) Divisive and agglomerative methods grossly fail to identify the com- 
munities when overlaps are significant, c) An example of overlapping /c-clique-communities at k = 4. 
The yellow community overlaps with the blue one in a single node, whereas it shares two nodes and 
a link with the green one. These overlapping regions are emphasised in red. Notice that any /c-clique 
(complete subgraph of size k) can be reached only from the fc-cliques of the same community through a 
series of adjacent fc-cliques. Two fc-cliques are adjacent if they share k — 1 nodes. 

a can be called as its community degree, d c £ m . Finally, the size s c ° m of any community a can most 
naturally be defined as the number of its nodes. To characterise the community structure of a large 
network we introduce the distributions of these four basic quantities. In particular, we will focus on their 
cumulative distribution functions denoted by P(s com ), P(d com ), P(s ov ), and P(m), respectively. For 
the overlap size, e.g., P(s ov ) means the proportion of those overlaps that are larger than s ov . Further 
relevant statistical features will be introduced later. 

The basic observation on which our community definition relies is that a typical community consists 
of several complete (fully connected) subgraphs that tend to share many of their nodes. Thus, we define a 
community, or more precisely, a k-clique-community as a union of all k-cliques (complete subgraphs of 
size k) that can be reached from each other through a series of adjacent /c-cliques (where adjacency means 
sharing k — 1 nodes) l2Tll23ll24l . This definition is aimed at representing the fact that it is an essential 
feature of a community that its members can be reached through well connected subsets of nodes. There 
are other parts of the whole network that are not reachable from a particular A;-clique, but they potentially 
contain further /c-clique-communities. In turn, a single node can belong to several communities. All 
these can be explored systematically and can result in a large number of overlapping communities (see 
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Fig. ^ for illustration). Note that in most cases relaxing this definition (e.g., by allowing incomplete 
fc-cliques) is practically equivalent to lowering the value of k. For finding meaningful communities, the 
way they are identified is expected to satisfy several basic requirements: it cannot be too restrictive, 
should be based on the density of links, is required to be local, should not yield any cut-node or cut- 
link (whose removal would disjoin the community) and, of course, should allow overlaps. We employ 
the community definition specified above, because none of the others in the literature satisfy all these 
requirements simultaneously I2T1I22I . 

Although the numerical determination of the full set of /c-clique-communities is a polynomial prob- 
lem, we use an algorithm which is exponential, because it is significantly more efficient for the graphs 
corresponding to actual data. This method is based on first locating all cliques (maximal complete 
subgraphs) of the network and then identifying the communities by carrying out a standard component 
analysis of the clique-clique overlap matrix 1211 . For more details about the method and its speed see the 
Supplementary Information. 

We use our method for binary networks (i.e., with undirected and unweighted links). An arbitrary 
network can always be transformed into a binary one by ignoring any directionality in the links and 
keeping only those that are stronger than a threshold weight w* . Changing the threshold is like changing 
the resolution (as in a microscope) with which the community structure is investigated: by increasing 
w* the communities start to shrink and fall apart. A very similar effect can be observed by changing the 
value of k as well: increasing k makes the communities smaller and more disintegrated, but at the same 
time, also more cohesive. 

When we are interested in the community structure around a particular node, it is advisable to scan 
through some ranges of k and w*, and monitor how its communities change. As an illustration, in Fig.|2] 
we are depicting the communities of three selected nodes of three large networks: (i) the social network 
of scientific collaborators l25l . (ii) the network of word associations ll26l related to cognitive sciences, 
and (iii) the molecular-biological network of protein-protein interactions E71 . These pictures can serve 
as tests or validations of the efficiency of our algorithm. In particular, the communities of the author G. 
Parisi (who is well known to have been making significant contributions in different fields of physics) 
shown in Fig.|2k are associated with his fields of interest, as it can be deduced from the titles of the papers 
involved. The 4-clique-communities of the word "bright" (Fig. I2J)) correspond to the various meanings 
of this word. An important biological application is finding the communities of proteins, based on their 
interactions. Indeed, most proteins in the communities shown in Figs. EJ; and |3] can be associated with 
either protein complexes or certain functions, as can be looked up by using the GO-TermFinder package 
[28 1 and the online tools of the Saccharomyces Genome Database (SGD) [29|. For some proteins no 
function is available yet. Thus, the fact that they show up in our approach as members of communities 
can be interpreted as a prediction for their functions. One such example can be seen in the enlarged 
portion of Fig. |3] For the protein Ycr072c, which is required for the viability of the cell and appears in 
the dark green community on the right, SGD provides no biological process (function). By far the most 
significant GO term for the biological process of this community is "ribosome biogenesis/assembly". 
Thus, we can infer that Ycr072c is likely to be involved in this process. Also, new cellular processes can 
be predicted if yet unknown communities are found with our method. 

These examples (and further ones included in the Supplementary Information) clearly demonstrate 
the advantages of our approach over the existing divisive and agglomerative methods recently used for 
large real networks. Divisive methods cut the network into smaller and smaller pieces, each node is 
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Figure 2: The community structure around a particular node in three different networks. The commu- 
nities are colour coded, the overlapping nodes and links between them are emphasised in red, and the 
volume of the balls and the width of the links are proportional to the total number of communities they 
belong to. For each network the value of k has been set to 4. a) The communities of G. Parisi in the 
co-authorship network of the Los Alamos cond-mat archive (for threshold weight w* = 0.75) can be 
associated with his fields of interest, b) The communities of the word "bright" in the South Florida Free 
Association norms list (for w* = 0.025) represent the different meanings of this word, c) The commu- 
nities of the protein ZDS 1 in the DIP core list of the protein-protein interactions of S. cerevisiae can be 
associated with either protein complexes or certain functions. 

forced to remain in only one community and be separated from its other communities, most of which 
then necessarily fall apart and disappear. This happens, e.g., with the word "bright" when we apply the 
method described in Ref . fi"6l : it tends to stay together mostly with the words of the community related 
to "light", while most of its other communities (e.g., those related to "colors", see Fig. EJd) completely 
disintegrate ("green" gets to the vegetables, "orange" to the fruits, etc.). Agglomerative methods do the 
same, but in the reverse direction. For example, when we applied the agglomerative method of Ref. [ 18 1, 
at some point "bright", as a single word, joined a "community" of 890 other words. In addition, such 
methods inevitably lead to a tree-like hierarchical rendering of the communities, while our approach 
allows the construction of an unconstrained network of communities. 

The networks chosen above have been constructed in the following ways. In the co-authorship net- 
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work of the Los Alamos e-print archives [25] each article contributes the value l/(n — 1) to the weight 
of the link between every pair of its n authors. In the South Florida Free Association norms list [26| 
the weight of a directed link from one word to another indicates the frequency that the people in the 
survey associated the end point of the link with its start point. For our purposes these directed links have 
been replaced by undirected ones with a weight equal to the sum of the weights of the corresponding 
two oppositely directed links. In the DIP core list of the protein-protein interactions of S. cerevisiae [ 27 1 
each interaction represents an unweighted link between the interacting proteins. These networks are very 
large, consisting of 30739, 10617, and 2609 nodes and 136065, 63788, and 6355 links, respectively. 

Although different values of k and w* might be optimal for the local community structure around 
different nodes, we should set some global criterion to fix their values if we want to analyse the statistical 
properties of the community structure of the entire network. The criterion we use is based on finding 
a community structure as highly structured as possible. In the related percolation phenomena 1241 a 
giant component appears when the number of links is increased above some critical point. Therefore, 
to approach this critical point from below, for each selected value of k (typically between 3 and 6) we 
lower the threshold w* until the largest community becomes twice as big as the second largest one. In 
this way we ensure that we find as many communities as possible, without the negative effect of having a 
giant community that would smear out the details of the community structure by merging many smaller 
communities. We denote by /* the fraction of links stronger than w* , and use only those values of k for 
which /* is not too small (not smaller than 0.5). This has led us to k = 6 and 5 with /* = 0.93 and 
0.75, respectively, for the collaboration network, and k = 4 with /* = 0.67 for the word association 
network. For the former network both sets of parameters result in very similar communities (see the 
Supplementary Information). Since for unweighted networks no threshold weight can be set, for these 
we simply select the smallest value of k for which no giant community appears. In case of the protein 
interaction network this gives k = 4, resulting in 82 communities. Due to this relatively low number, we 
can depict the entire network of protein communities as presented in Fig. |3] 

The four distributions characterising the global community structure of these networks are displayed 
in Fig. H] Although the scaling of the size of non-overlapping communities has already been demon- 
strated for social networks ffSl IP71 . it is striking to observe how this aspect of large real networks is 
preserved even when a more complete picture (allowing overlaps) is investigated. In Fig. @t the power 
law dependence P(s com ) ~ (s com )~ T with an exponent ranging between r = 1 and 1.6 is well pro- 
nounced and is valid nearly over the entire range of community sizes. 

It is well known EJ |3] El that the nodes of large real networks have a power law degree distri- 
bution. Will the same kind of distribution hold when we move to the next level of organisation and 
consider the degrees of the communities? Remarkably, we find that it is not the case. The community 
degrees (Fig. @J)) have a very unique distribution, consisting of two distinct parts: an exponential decay 
P(d com ) ~ exp(— d com /dQ° m ) with a characteristic community degree dQ° m (which is in the order of 
(d com ) shown in Table followed by a power law tail ~ (d com )~ r . This new kind of behaviour is 
consistent with the community size distribution assuming that on average each node of a community has 
a contribution 5 to the community degree. The tail of the community degree distribution is, therefore, 
simply proportional to that of the community size distribution. At the first part of P(<i com ), on the other 
hand, a characteristic scale c?Q° m ~ k5 appears, because the majority of the communities have a size of 
the order of k (see Fig. |4^) and their distribution around d^ m dominates this part of the curve. Thus, 
the degree to which P(d com ) deviates from a simple scaling depends on k or, in other words, on the 
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Figure 3: Network of the 82 communities in the DIP core list of the protein-protein interactions of 5. 
cerevisiae for k = 4. The area of the circles and the width of the links are proportional to the size of 
the corresponding communities (s™ m ) and to the size of the overlaps (s° v A respectively. The coloured 
communities are cut out and magnified to reveal their internal structure. In this magnified picture the 
nodes and links of the original network have the same colour as their communities, those that are shared 
by more than one community are emphasised in red, and the grey links are not part of these communities. 
The area of the circles and the width of the links are proportional to the total number of communities 
they belong to. 

prescribed minimum cohesiveness of the communities. 

The extent to which different communities overlap is also a relevant property of a network. Although 
the range of overlap sizes is limited, the behaviour of the cumulative overlap size distribution P(s ov ), 
shown in Fig.Ht, is close to a power law for each network, with a rather large exponent. We can conclude 
that there is no characteristic overlap size in the networks. Finally, in Fig.@Jl we display the cumulative 
distribution of the membership number, P(m). These plots demonstrate that a node may belong to a 
number of communities. In the collaboration and the word association network there seems to be no 
characteristic value for the membership number: the data are close to a power law dependence, with 
a large exponent. In the protein interaction network, however, the largest membership number is only 
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Figure 4: Statistics of the /c-clique-communities for three large networks. These are the co-authorship 
network of the Los Alamos cond-mat archive (triangles, k=6, f* = 0.93), the word association network 
of the South Florida Free Association norms (squares, k = 4, /* = 0.67), and the protein interaction 
network of the yeast 5. cerevisiae from the DIP database (circles, k = 4). (a) The cumulative distribution 
function of the community size follows a power law with exponents between —1 (upper line) and —1.6 
(lower line), (b) The cumulative distribution of the community degree starts exponentially and then 
crosses over to a power law (with the same exponent as for the community size distribution). Plot (c) is 
the cumulative distribution of the overlap size and (d) is that of the membership number. 



4, which is consistent with the also rather short distribution of its community degree. To show that the 
communities we find are not due to some sort of artifact of our method, we have also determined the 
above distributions for "randomised" graphs with parameters (size, degree sequence, k and /*) being the 
same as those of our three examples, but with links stochastically redistributed among the nodes. We 
have found that indeed the distributions are extremely truncated, signifying a complete lack of the rich 
community structure determined for the original data. 

In Table [2 we have collected a few interesting statistical properties of the network of communities. 
It should be pointed out that the average clustering coefficients (C com ) are relatively high, indicating 
that two communities overlapping with a given community are likely to overlap with each other as well, 
mostly because they all share the same overlapping region. The high fraction of shared nodes is yet 
another indication of the importance of overlaps between the communities. 

The specific scaling of the community degree distribution is a novel signature of the hierarchical 
nature of the systems we study. We find that if we consider the network of communities instead of the 
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Table 1: Statistical properties of the network of communities. N com is the number of communities, 
(d com ) is the average community degree, (C com ) is the average clustering coefficient of the network of 
communities, and (r) represents the average fraction of shared nodes in the communities. 

nodes themselves, we still observe a degree distribution with a fat tail, but a characteristic scale appears, 
below which the distribution is exponential. This is consistent with our understanding of a complex 
system having different levels of organisation with units specific to each level. In the present case the 
principle of organisation (scaling) is preserved (with some specific modifications) when going to the next 
level in good agreement with the recent finding of the self-similarity of many complex networks [30]. 

With recent technological advances, huge sets of data are accumulating at a tremendous pace in vari- 
ous fields of human activity (including telecommunication, the internet, stock markets) and in many areas 
of life and social sciences (biomolecular assays, genetic maps, groups of web users, etc.). Understanding 
both the universal and specific features of the networks associated with these data has become an actual 
and important task. The knowledge of the community structure enables the prediction of some essential 
features of the systems under investigation. For example, since with our approach it is possible to "zoom" 
onto a single unit in a network and uncover its communities (and the communities connected to these, 
and so on), we provide a tool to interpret the local organisation of large networks and can predict how 
the modular structure of the network changes if a unit is removed (e.g., in a gene knock out experiment). 
A unique feature of our method is that we can simultaneously look at the network at a higher level of 
organisation and locate the communities that play a key role within the web of communities. Among the 
many possible applications is a more sophisticated approach to the spreading of infections (e.g., real or 
computer viruses) or information in highly modular complex systems. 
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